17 research outputs found
SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering
Version information plays an important role in spreadsheet understanding,
maintaining and quality improving. However, end users rarely use version
control tools to document spreadsheet version information. Thus, the
spreadsheet version information is missing, and different versions of a
spreadsheet coexist as individual and similar spreadsheets. Existing approaches
try to recover spreadsheet version information through clustering these similar
spreadsheets based on spreadsheet filenames or related email conversation.
However, the applicability and accuracy of existing clustering approaches are
limited due to the necessary information (e.g., filenames and email
conversation) is usually missing. We inspected the versioned spreadsheets in
VEnron, which is extracted from the Enron Corporation. In VEnron, the different
versions of a spreadsheet are clustered into an evolution group. We observed
that the versioned spreadsheets in each evolution group exhibit certain common
features (e.g., similar table headers and worksheet names). Based on this
observation, we proposed an automatic clustering algorithm, SpreadCluster.
SpreadCluster learns the criteria of features from the versioned spreadsheets
in VEnron, and then automatically clusters spreadsheets with the similar
features into the same evolution group. We applied SpreadCluster on all
spreadsheets in the Enron corpus. The evaluation result shows that
SpreadCluster could cluster spreadsheets with higher precision and recall rate
than the filename-based approach used by VEnron. Based on the clustering result
by SpreadCluster, we further created a new versioned spreadsheet corpus
VEnron2, which is much bigger than VEnron. We also applied SpreadCluster on the
other two spreadsheet corpora FUSE and EUSES. The results show that
SpreadCluster can cluster the versioned spreadsheets in these two corpora with
high precision.Comment: 12 pages, MSR 201
VEnron1.0
<div>VEnron1.0 is an industrial-scale and public spreadsheet corpus with version information, including 360 evolution groups and 7,294 spreadsheets. (Multiple versions originated from the same spreadsheet are considered as an evolution group.)</div><div>VEnron1.0 is published by our ICSE SEIP 2016 paper.</div><div><br></div><div><b>Wensheng Dou</b>, Liang Xu, Shing-Chi Cheung, Chushu Gao, Jun Wei, Tao Huang. VEnron: A Versioned Spreadsheet Corpus and Related Evolution Analysis. In <i>Proceedings of the 38th International Conference on Software Engineering</i> (<b><i>ICSE SEIP 2016</i></b>), pages 162-171, Austin, TX, USA, May 2016.<br></div
VEnron2
<div>VEnron2 is an industrial-scale and public spreadsheet corpus with version information, including 1,609 evolution groups and 12,254 spreadsheets. (Multiple versions originated from the same spreadsheet are considered as an evolution group.)<br></div><div>VEnron2 is a big improvement to VEnron1.1. We extrace much more evolution groups and spreadsheets from the Enron email archive, by using SpreadCluster.</div><div>VErnon2 is published associated with our MSR 2017 paper in May 2017.</div><div><br></div><div>Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, Tao Huang. SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering. In <i>Proceedings of the 14th International Conference on Mining Software Repositories</i> (<b><i>MSR 2017</i></b>), May 2017.</div
VEUSES
<div>EUSES is the most frequently used spreadsheet corpus, and contains 4,037 spreadsheets. These spreadsheets were extracted from World Wide Web. </div><div>We applied SpreadCluster to the EUSES and manually validated all groups. Based on the validated result, we built the VEUSES corpus, containing 177 evolution groups and 363 spreadsheets.</div><div>VEUSES is published associated with our MSR 2017 paper in May 2017. </div><div><br></div><div>Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, Tao Huang. SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering. In <i>Proceedings of the 14th International Conference on Mining Software Repositories</i> (<b><i>MSR 2017</i></b>), May 2017.<br></div
VEnron1.1
<div>VEnron1.1 is an industrial-scale and public spreadsheet corpus with version information, including 322 evolution groups and 7,171 spreadsheets. (Multiple versions originated from the same spreadsheet are considered as an evolution group.)</div><div>VEnron1.1 is an improvement to VEnron1.0. We fix some errors in VEnron1.0, and also design a simply layout structure to store versioned spreadsheets.</div><div>VErnon1.1 is published associated with our MSR 2017 paper in May 2017.</div><div><br></div><div>Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, Tao Huang. SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering. In <i>Proceedings of the 14th International Conference on Mining Software Repositories</i> (<b><i>MSR 2017</i></b>), May 2017.<br></div
VFUSE
<div><br></div><div>FUSE is a reproducible, internet-scale corpus, and contains 249,376 unique spreadsheets that were extracted from over 26.83 billion pages. </div><div>We applied SpreadCluster to the FUSE and manually validated 200 groups that were randomly selected from the clustering result. Based on the validated result, we built the VFUSE corpus, containing 188 evolution groups and 1,143 spreadsheets.</div><div>VFUSE is published associated with our MSR 2017 paper in May 2017. </div><div><br></div><div>Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, Tao Huang. SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering. In <i>Proceedings of the 14th International Conference on Mining Software Repositories</i> (<b><i>MSR 2017</i></b>), May 2017.<br></div
Mining Vehicles Frequently Appearing Together from Massive Passing Records
Vehicles Frequently Appearing Together, or VFATs, can be clues in solving criminal cases. Traditional sequence mining approaches help identify VFATs from passing-through records collected at monitoring sites. However, huge traffic data streams hinder fast identification of VFATs. In this paper, we present a multi-threaded approach to fast identification of VFATs based on multi-core processors, called Frequent Sequential Mining based on Multi-Cores (FSMMC). It parallels the execution of tasks, partitions large volumes of data, and obtains VFATs by merging local candidates discovered in different threads running on different processor cores. Through local parallel reduction, FSMMC eliminates the repetitive patterns and reduces computational effort. Moreover, it achieves workload balance by the dynamic distribution of tasks to a pool of threads where the thread that finishes first joins another running thread. Both theoretical analysis and case studies show that FSMMC takes full advantage of multi-core computing platforms and has higher speed-up when searching VFATs among massive passing through records, compared with other approaches without multithreading